Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment

نویسندگان

  • Saurabh Bagchi
  • Balaji Srinivasan
  • Keith Whisnant
  • Zbigniew T. Kalbarczyk
  • Ravishankar K. Iyer
چکیده

In this paper, we propose a hierarchical framework for providing fault tolerance to the SIFT layer of a distributed system, and extending it to the applications executing in such an environment. The detection hierarchy is proposed in the context of Chameleon, a software environment for providing adaptive faulttolerance in a COTS environment to off-the-shelf software. A flexible mechanism for combining different levels in the hierarchy and different techniques within a level is proposed. We define intra-level and interlevel optimizations to minimize the overhead of detection and make the optimizations adaptive to runtime requirements. New approaches for software signatures and diagnosis through interactive consistency protocols are highlighted. The paper presents results from a detailed simulation of the environment, using as parameters, measurements obtained from an early prototype implementation. The results indicate the increase in availability due to the detection framework and help understand the trade-offs between overhead and coverage for different combinations of techniques.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hierarchical Error Detection and Recovery in a Software Implemented Fault Tolerance (SIFT) Environment

A key issue in the design of reliable distributed systems is how to make the entities that provide the reliability properties of the system, themselves failure resilient. An application executing in such a system is dependent on these entities and hence, it is critical to protect not just the application, but also the components of the fault tolerance layer, through a variety of error detection...

متن کامل

An Experimental Evaluation of the REE SIFT Environment for Spaceborne Applications

Few distributed software-implemented fault tolerance (SIFT) environments have been experimentally evaluated using substantial applications to show that they protect both themselves and the applications from errors. This paper presents an experimental evaluation of a SIFT environment used to oversee spaceborne applications as part of the Remote Exploration and Experimentation (REE) program at th...

متن کامل

Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data

The discussion in this paper focuses on the issues involved in analyzing the availability of networked systems using fault injection and the failure data collected by the logging mechanisms built into the system. In particular we address: (1) analysis in the prototype phase using physical fault injection to an actual system. We use example of fault injection-based evaluation of a software-imple...

متن کامل

The FTMPS { Project : Design and Implementation of Fault { Tolerance Techniques for Massively Parallel Systems 1

The FTMPS-project provides a solution to the need for fault{ tolerance in large systems. A complete fault-tolerance approach is developed and being implemented. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Knowl. Data Eng.

دوره 12  شماره 

صفحات  -

تاریخ انتشار 2000